Skip to content

Html api docs improvement#61

Draft
sirreal wants to merge 193 commits into
trunkfrom
html-api-docs-improvement
Draft

Html api docs improvement#61
sirreal wants to merge 193 commits into
trunkfrom
html-api-docs-improvement

Conversation

@sirreal

@sirreal sirreal commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Trac ticket:

Use of AI Tools


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

sirreal added 30 commits June 11, 2026 18:37
Scaffolding for the autonomous documentation-improvement loop:
- PLAN.md records the full agreed design (corpus, scoring, isolation,
  harness, round flow, revert and stopping rules).
- render-docs-markdown.py deterministically renders phpdoc-parser JSON
  to agent-readable markdown, excluding implementation leakage.
16 tasks (12 train + 4 held-out), each with a subagent-facing prompt,
a validated reference implementation, and frozen hidden test cases.
Expected outputs were generated from the references and cross-checked
against PHP's Dom\HTMLDocument where semantics overlap (text
extraction, links, tables, outlines) — all agree.

Harness executes candidates standalone (no WordPress boot) with shims
for the six WP functions the html-api files reference; each test case
runs in an isolated subprocess with a 10s timeout so parse errors,
fatals, and infinite loops are contained and reported.
- stage-round.sh: regenerate JSON, render markdown, stage isolated
  scratch dir containing only the two markdown files.
- docs-only-guard.php: comment-stripped token-stream identity vs HEAD
  plus php -l, run before every round that follows doc edits.
- aggregate-round.py: trial/task/round scoring per PLAN.md formula.
- PROTOCOL.md: runbook with exact test-subagent and judge prompt
  templates, judge rubric, and results layout.
- docs-test-subject agent definition (Read+Grep only) for structural
  isolation in future sessions.

Pilot validated end-to-end: Sonnet test subject on T01 returned
well-formed output passing 8/8 hidden cases.
trials-workflow.js fans out one docs-only test subject per task-trial
with structured output; judge-workflow.js fans out one Opus judge per
task with the adherence rubric and doc-gap analysis; persist-trials.py
writes candidates to results/ and executes them against hidden tests.
48 Sonnet trials (16 tasks x 3) judged by 16 Opus judges.
TRAIN 93.57 / HELD-OUT 93.47. Dominant systematic failure: undocumented
closer-token depth semantics plus missing subtree-walk idiom (T03, T06,
H04). Secondary: get_modifiable_text() decoding unstated (T08, H04);
serialize_token() rewrite idiom undocumented (T12); misleading
tables-unsupported bullet (T08).
Round-0 failures in T03, T06, and held-out H04 shared one root cause:
nothing documents that a closing-tag token reports the PARENT's depth
(the element is already popped when matched on its closer). All three
T03 trials lost trailing text after nested elements by breaking their
walk loops at 'depth <= opener depth'.

get_current_depth(): state the closer rule explicitly, define depth as
breadcrumb count including non-element tokens, extend the existing
example through the closing tokens, and add the canonical
visit-every-token-inside-an-element loop (depth >= opener depth).
is_tag_closer() (HTML Processor): note that breadcrumbs and depth
reflect the parent context when matched on a closer.
…_token().

The docblock described the method as internal ('do not use') and steered
readers to the Tag Processor 'for access to the raw tokens' — the
opposite of the right guidance for structure-aware text collection,
which round-0 judges identified as a driver of the T06 failures (two of
three trials collected nothing).

Rewrite the description: define tokens, position next_token() as the
right tool when non-tag content matters alongside structure, document
that closers are visited for every opener (including implicit and
end-of-input closes), warn that text may split across consecutive #text
tokens, and add the canonical collect-text-of-an-element example in both
depth-guard and breadcrumbs-guard forms (both verified by execution).
@SInCE history left as-is.
…coded text.

Round-0 judges (T08, H04) flagged that nothing states whether the
returned text has character references decoded — the single most
load-bearing fact for text extraction. Several subjects bolted on a
redundant html_entity_decode() pass, which double-decodes and corrupts
text like '&amp;amp;'.

State the decoding rule with its boundaries (decoded for #text and
RCDATA elements like TEXTAREA/TITLE; verbatim for raw text SCRIPT/STYLE
and comment interiors — all verified by execution), add a one-line
example, and note the set_modifiable_text() inverse so callers work in
decoded space on both sides.
TRAIN 98.78 (+5.21 vs baseline). 36/36 trials passed every hidden case.
T03 +13.95 (closer-depth rule + subtree-walk example), T06 +46.33
(next_token() rehabilitation), no regressions beyond judge noise.
Sonnet has plateaued >=90 for two consecutive rounds; next step per
plan is the Haiku re-baseline. Round-2 adherence targets logged.
Task-first rebalance: add six tasks forcing undercovered concepts
(class removal, contextual selection, truncated-input detection,
normalize() failure handling, full-document parsing, HTML-vs-SVG image
namespace). New held-out set: N01/N02/N05/H04; H01-H03 retired;
T01/T02 relabeled smoke. Every task now carries role/commonness/
concept/processor labels and the aggregator reports per-concept means.
All new references harness-validated; N02/N05/N06 cross-checked
against Dom\HTMLDocument (covering image->img conversion and
img-breaking-out-of-svg).
Two of three Haiku trials on the build-figure task produced correct
markup with src/alt swapped and scored 0/6 — the docs never explain
where set_attribute() puts attributes. Verified by execution: updates
replace in place keeping position; NEW attributes insert after the tag
name before existing ones; multiple new attributes sort by attribute
name regardless of call order. Document all three rules plus the
start-from-a-template idiom for when output order matters.

Also fixes a judge-discovered bug in the paused_at_incomplete_token()
example, which called the nonexistent get_next_tag() instead of
next_tag().
…claims.

The class docblock claimed the HTML Processor cannot process any
element inside a TABLE, any foreign content (SVG/MathML), or anything
outside the IN BODY insertion mode. All three claims are false on this
branch — round-2 trials parsed well-formed tables, SVG content, and
full documents with head content; judges traced T08's defensive
fallback code directly to this passage.

Replace with verified behavior: the processor parses these fine and
aborts only on specific constructs — foster-parented content (e.g. a
DIV directly inside TABLE) and mis-nested formatting requiring
advance-and-rewind reconstruction (e.g. '<b>one<i>two</b>three</i>'),
both confirmed by execution, with simple mis-nesting supported. Also
document how aborts surface: get_last_error(),
get_unsupported_exception(), and null from serialize()/normalize().
Round-1 judges (T12) flagged that nothing connects serialize_token()
to its purpose: subjects mixed token loops with whole-string
normalize(), unsure which was right. Document that concatenating
serialize_token() across a next_token() walk reproduces serialize(),
that the token-by-token form exists for selective rewriting (skip to
remove, emit around to wrap), and that closers of skipped elements
must be skipped too — with an execution-verified removal example.
Cross-reference guidance: serialize() for unchanged output, the loop
for transformations.
All-19 91.47 / core 90.47 / train 92.56 / held-out 87.38. Round-1
edits transfer to Haiku (T03, T06 perfect). Per-concept reporting
exposes the gaps the aggregate hides: attributes 72.2 (set_attribute
ordering), full-document 78.0 (held-out, no edit made), namespace
85.9. Round-3 hypothesis edits committed separately.
ingest-trials.py and ingest-judges.py condense per-round bookkeeping
(persist, execute, aggregate, compare, gap digest with held-out gaps
marked DO-NOT-ACT) into single commands, keeping orchestration
overhead low across the 100+ round goal.
…d edits.

Round 3 confirmed the serialize_token() idiom (round-3 H3) helped its
targets (T09 +8.6, T12 +2.2) but induced a T07 regression (-33.7): two
trials called serialize() after add_class(), got null (scanning had
begun), and returned the unmodified input. Refining rather than
reverting, disclosed in LOG: state the boundary explicitly on both
serialize() and serialize_token() — queued attribute/class/text
updates are read with the inherited get_updated_html(); serialize()
demands a fresh processor and returns null once scanning has begun;
serialization is for normalizing/rewriting, get_updated_html() for
edits.
Two of three round-3 trials on the build-figure task produced empty
captions: they matched the empty FIGCAPTION tag and called
set_modifiable_text(), which returns false there — ordinary container
elements carry no text of their own and an empty element has no #text
token to modify. Nothing documented this. State the eligible token
kinds, the empty-element limitation, the check-the-return-value rule,
and the placeholder-template idiom (verified by execution).
…X idiom.

T10 adherence sat at ~80 because the set_bookmark() docblock forbids
programmatic names without stating the supported alternative; subjects
hedged with bookmark-count workarounds. State explicitly that
re-setting an existing name MOVES the bookmark (no leak, no release
needed) and that same-name-per-match is the idiom for tracking the
last occurrence in one pass (verified by execution; the docblock's
own last-li example already relied on it silently).

Also state the documented default for next_tag()'s tag_closers option
('skip'), which round-3 judges flagged as unstated.
…, two new gaps.

All-19 87.41 / train 90.66 (-1.9) / held-out 75.22. Round-3 edits
helped their targets (T09 +8.6, T12 +2.2, N06 +10.7, N04 100) but the
serialization idiom induced T07 -33.7 (serialize() after mutations).
Refined rather than reverted, with the boundary now stated. Round-4
hypotheses committed separately.
T04 trials each absorbed exactly one of the two template-building facts
(pre-seeded attribute order in set_attribute(), placeholder text in
set_modifiable_text()) and failed on the other — the facts live in two
distant method docblocks. Add a 'Building markup from a template'
section to the class overview, where template builders first look,
stating both rules together with one execution-verified example using
a link template (deliberately unlike any corpus task).
…decoded reads; add_class idempotency.

Judges across four tasks flagged the same unstated guarantees subjects
kept inferring (correctly, but unguided):
- next_tag(): tag-name matching is ASCII case-insensitive with source
  casing preserved; comments/CDATA/rawtext can never match; truncated
  trailing tags are never matched or modified (cross-ref
  paused_at_incomplete_token()). Stated as a 'What this matches' block.
- get_attribute(): string values come back DECODED (don't decode
  again), inverse of set_attribute's encode-on-write.
- add_class(): creates/appends without disturbing existing classes;
  re-adding an existing class is a no-op with an exact byte-for-byte
  duplicate check (add 'NOTE' to class="note" appends — verified;
  an initial case-insensitive claim was caught wrong by probe before
  commit).
T08 judges noted the only depth-bounded walk example nests one level,
where >= and > behave identically, so readers can't learn which is
right. State the rule: >= is correct at any depth; > ends the walk at
the first direct-child closer (verified: with > the UL walk stops
after the first LI's contents).
…oundary, get_updated_html identity, >= warning placement.

Round-5's two single-trial collapses both trace to unstated boundaries:
a T06 trial attempted tree-aware work in the Tag Processor (whose docs
never say it lacks depth/breadcrumbs), and a T03 trial copied the
next_token() example but guessed '>' because the >= warning only
existed in get_current_depth().

- Tag Processor overview: 'Which processor should I use?' section
  stating it has NO tree awareness and where those methods live;
  HTML Processor overview gets the matching half.
- get_updated_html(): own description at last (was a copy of
  __toString's) — read-your-edits semantics, byte preservation,
  safe mid-scan.
- next_token() example now carries the >= warning inline where the
  failing trial actually read.
…ocessor, >= beside the operator, drain idiom, add_class return semantics.

Round-6 train gaps: the HTML Processor's own get_modifiable_text()
override never stated decoding or that SCRIPT/STYLE/TEXTAREA/TITLE
carry their text on the element token (no #text child) — stated now
with a verified full-parser TITLE example; the >= rule now sits beside
the operator in the get_current_depth() example with the
nested-closer/sibling-text explanation inline; the
paused_at_incomplete_token() example gains the drain-all-tokens idiom
its single-tag example obscured; add_class() return documented as
enqueued-not-applied (false only with no matched tag, verified).
…boundary rule.

Round-7's only functional miss (T05 5/9) sliced multibyte text without
an explicit encoding; the docs say UTF-8 is the only supported input
but never said the output of get_modifiable_text() is UTF-8 nor showed
the mb_* explicit-encoding idiom — stated on both classes now. T08's
recurring boundary confusion appears in break-form code that the
continue-form-only warning misses: stated the equivalence
(break at < depth, never <= depth).
…e last-X bookmark idiom.

T08's recurring failure class is nested walk loops double-advancing
the single cursor: the inner collect-until-close loop exits already
matched on the next region's boundary token, which the outer loop's
next_token() then skips. Document the one-cursor contract on
next_token() with the closer-driven single-pass state-machine shape
(verified DT example), noting it stays reliable on malformed input
because closers are always visited. Also surface the
re-set-the-same-bookmark-name idiom in the overview bookmarks
narrative where T10 trials kept missing it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant